In this project, I will be performing an unsupervised clustering of data on the customer's records from a groceries firm's database. Customer segmentation is the practice of separating customers into groups that reflect similarities among customers in each cluster. I will divide customers into segments to optimize the significance of each customer to the business. To modify products according to distinct needs and behaviours of the customers. It also helps the business to cater to the concerns of different types of customers.

IMPORTING LIBRARIES

In [1]:
#Importing the Libraries
import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")
np.random.seed(42)

LOADING DATA

In [2]:
data = pd.read_csv("Desktop/marketing_campaign.csv", sep = "\t")
print("Number of datapoints:", len(data))
data.head()
Number of datapoints: 2240
Out[2]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 5 0 0 0 0 0 0 3 11 0

5 rows × 29 columns

DATA CLEANING

In [3]:
# Check the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
ID                     2240 non-null int64
Year_Birth             2240 non-null int64
Education              2240 non-null object
Marital_Status         2240 non-null object
Income                 2216 non-null float64
Kidhome                2240 non-null int64
Teenhome               2240 non-null int64
Dt_Customer            2240 non-null object
Recency                2240 non-null int64
MntWines               2240 non-null int64
MntFruits              2240 non-null int64
MntMeatProducts        2240 non-null int64
MntFishProducts        2240 non-null int64
MntSweetProducts       2240 non-null int64
MntGoldProds           2240 non-null int64
NumDealsPurchases      2240 non-null int64
NumWebPurchases        2240 non-null int64
NumCatalogPurchases    2240 non-null int64
NumStorePurchases      2240 non-null int64
NumWebVisitsMonth      2240 non-null int64
AcceptedCmp3           2240 non-null int64
AcceptedCmp4           2240 non-null int64
AcceptedCmp5           2240 non-null int64
AcceptedCmp1           2240 non-null int64
AcceptedCmp2           2240 non-null int64
Complain               2240 non-null int64
Z_CostContact          2240 non-null int64
Z_Revenue              2240 non-null int64
Response               2240 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
  • There are missing values in income
  • Dt_Customer that indicates the date a customer joined the database is not parsed as DateTime
  • There are some categorical features in our data frame.
In [4]:
#First of all, for the missing values, I am simply going to drop the rows that have missing income values.
data = data.dropna()
len(data)
Out[4]:
2216

In the next step, I am going to create a feature out of "Dt_Customer" that indicates the number of days a customer is registered in the firm's database. However, in order to keep it simple, I am taking this value relative to the most recent customer in the record.

Thus to get the values I must check the newest and oldest recorded dates.

In [5]:
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
dates = []
for i in data["Dt_Customer"]:
    i = i.date()
    dates.append(i)
print("the newest record:", max(dates))
print("the oldest record:", max(dates))  
the newest record: 2014-12-06
the oldest record: 2014-12-06

Creating a feature ("Customer_For") of the number of days the customers started to shop in the store relative to the last recorded date.

In [6]:
#Created a feature "Customer_For"
days = []
d1 = max(dates) #taking it to be the newest customer
for i in dates:
    delta = d1 - i
    days.append(delta)
data["Customer_For"] = days
data['Customer_For'] = data.Customer_For.astype(str).map(lambda x: x[:-4])
data["Customer_For"] = pd.to_numeric(data["Customer_For"], errors="coerce")
In [7]:
print("Total categories in the feature Marital_Status:\n", data["Marital_Status"].value_counts(), "\n")
print("Total categories in the feature Education:\n", data["Education"].value_counts())
Total categories in the feature Marital_Status:
 Married     857
Together    573
Single      471
Divorced    232
Widow        76
Alone         3
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64 

Total categories in the feature Education:
 Graduation    1116
PhD            481
Master         365
2n Cycle       200
Basic           54
Name: Education, dtype: int64
  • Extract the "Age" of a customer by the "Year_Birth" indicating the birth year of the respective person.
  • Create another feature "Spent" indicating the total amount spent by the customer in various categories over the span of two years.
  • Create another feature "Living_With" out of "Marital_Status" to extract the living situation of couples.
  • Create a feature "Children" to indicate total children in a household that is, kids and teenagers.
  • To get further clarity of household, Creating feature indicating "Family_Size"
  • Create a feature "Is_Parent" to indicate parenthood status
  • Lastly, I will create three categories in the "Education" by simplifying its value counts.
  • Dropping some of the redundant features
In [8]:
#Age of customer today 
data["Age"] = 2021-data["Year_Birth"]

#Total spendings on various items
data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+ data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]

#Deriving living situation by marital status"Alone"
data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

#Feature indicating total children living in the household
data["Children"]=data["Kidhome"]+data["Teenhome"]

#Feature for total members in the householde
data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner":2})+ data["Children"]

#Feature pertaining parenthood
data["Is_Parent"] = np.where(data.Children> 0, 1, 0)

#Segmenting education levels in three groups
data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

#For clarity
data=data.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})

#Dropping some of the redundant features
to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID", "Customer_For"]
data = data.drop(to_drop, axis=1)
In [9]:
data.describe()
Out[9]:
Income Kidhome Teenhome Recency Wines Fruits Meat Fish Sweets Gold ... AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Spent Children Family_Size Is_Parent
count 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 ... 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000
mean 52247.251354 0.441787 0.505415 49.012635 305.091606 26.356047 166.995939 37.637635 27.028881 43.965253 ... 0.073105 0.064079 0.013538 0.009477 0.150271 52.179603 607.075361 0.947202 2.592509 0.714350
std 25173.076661 0.536896 0.544181 28.948352 337.327920 39.793917 224.283273 54.752082 41.072046 51.815414 ... 0.260367 0.244950 0.115588 0.096907 0.357417 11.985554 602.900476 0.749062 0.905722 0.451825
min 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 25.000000 5.000000 0.000000 1.000000 0.000000
25% 35303.000000 0.000000 0.000000 24.000000 24.000000 2.000000 16.000000 3.000000 1.000000 9.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 44.000000 69.000000 0.000000 2.000000 0.000000
50% 51381.500000 0.000000 0.000000 49.000000 174.500000 8.000000 68.000000 12.000000 8.000000 24.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 51.000000 396.500000 1.000000 3.000000 1.000000
75% 68522.000000 1.000000 1.000000 74.000000 505.000000 33.000000 232.250000 50.000000 33.000000 56.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 62.000000 1048.000000 1.000000 3.000000 1.000000
max 666666.000000 2.000000 2.000000 99.000000 1493.000000 199.000000 1725.000000 259.000000 262.000000 321.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 128.000000 2525.000000 3.000000 5.000000 1.000000

8 rows × 27 columns

The above stats show some discrepancies in mean Income and Age and max Income and age. Do note that max-age is 128 years, As I calculated the age that would be today

In [10]:
#Dropping the outliers by setting a cap on Age and income. 
data = data[(data["Age"] < 90)]
data = data[(data["Income"]<600000)]
print("The total number of data-points after removing the outliers are:", len(data))
The total number of data-points after removing the outliers are: 2212
In [11]:
#correlation matrix
corrmat= data.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat, annot=True, cmap="YlGnBu", center=0)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b8aa1c3ef0>

DATA PREPROCESSING

  • Label encoding the categorical features
  • Scaling the features using the standard scaler
  • Creating a subset dataframe for dimensionality reduction
In [12]:
# Get list of categorical variables
s = (data.dtypes == "object")
object_cols = list(s[s].index)
print("Categorical variables in the dataset:", object_cols)
Categorical variables in the dataset: ['Education', 'Living_With']
In [13]:
# Label Encoding the object dtypes.
LE=LabelEncoder()
for i in object_cols:
    data[i]=data[[i]].apply(LE.fit_transform)
In [14]:
# Create copy of data
ds = data.copy()
# creating a subset of dataframe by dropping the features on deals accepted and promotions
cols_del = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
ds = ds.drop(cols_del, axis = 1)
#Scaling
scaler = StandardScaler()
scaler.fit(ds)
scaled_ds = pd.DataFrame(scaler.transform(ds), columns = ds.columns)
print("All features are now scaled")
All features are now scaled
In [15]:
#Scaled data to be used for reducing the dimensionality
print("Dataframe to be used for further modelling:")
scaled_ds.head()
Dataframe to be used for further modelling:
Out[15]:
Education Income Kidhome Teenhome Recency Wines Fruits Meat Fish Sweets ... NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Age Spent Living_With Children Family_Size Is_Parent
0 -0.893586 0.287105 -0.822754 -0.929699 0.310353 0.977660 1.552041 1.690293 2.453472 1.483713 ... 1.426865 2.503607 -0.555814 0.692181 1.018352 1.676245 -1.349603 -1.264598 -1.758359 -1.581139
1 -0.893586 -0.260882 1.040021 0.908097 -0.380813 -0.872618 -0.637461 -0.718230 -0.651004 -0.634019 ... -1.126420 -0.571340 -1.171160 -0.132545 1.274785 -0.963297 -1.349603 1.404572 0.449070 0.632456
2 -0.893586 0.913196 -0.822754 -0.929699 -0.795514 0.357935 0.570540 -0.178542 1.339513 -0.147184 ... 1.426865 -0.229679 1.290224 -0.544908 0.334530 0.280110 0.740959 -1.264598 -0.654644 -1.581139
3 -0.893586 -1.176114 1.040021 -0.929699 -0.795514 -0.872618 -0.561961 -0.655787 -0.504911 -0.585335 ... -0.761665 -0.913000 -0.555814 0.279818 -1.289547 -0.920135 0.740959 0.069987 0.449070 0.632456
4 0.571657 0.294307 1.040021 -0.929699 1.554453 -0.392257 0.419540 -0.218684 0.152508 -0.001133 ... 0.332600 0.111982 0.059532 -0.132545 -1.033114 -0.307562 0.740959 0.069987 0.449070 0.632456

5 rows × 22 columns

DIMENSIONALITY REDUCTION

  • Dimensionality reduction with PCA
  • Plotting the reduced dataframe
In [16]:
#Initiating PCA to reduce dimentions aka features to 3
pca = PCA(n_components=3)
pca.fit(scaled_ds)
PCA_ds = pd.DataFrame(pca.transform(scaled_ds), columns=(["col1","col2", "col3"]))
PCA_ds.describe().T
Out[16]:
count mean std min 25% 50% 75% max
col1 2212.0 -5.079321e-17 2.877338 -5.940876 -2.552996 -0.776146 2.394880 7.411440
col2 2212.0 1.068063e-16 1.699736 -4.285528 -1.329375 -0.149694 1.244895 6.110742
col3 2212.0 1.540861e-17 1.153616 -2.943620 -0.890900 -0.140871 0.813051 3.965047
In [17]:
#A 3D Projection Of Data In The Reduced Dimension
x =PCA_ds["col1"]
y =PCA_ds["col2"]
z =PCA_ds["col3"]

#To plot
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

CLUSTERING

  • Elbow Method to determine the number of clusters to be formed
  • Clustering via Agglomerative Clustering
  • Examining the clusters formed via scatter plot
In [18]:
# Quick examination of elbow method to find numbers of clusters to make.
print('Elbow Method to determine the number of clusters to be formed:')
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(PCA_ds)
Elbow_M.show()
Elbow Method to determine the number of clusters to be formed:
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b8ab361710>
In [19]:
#Initiating the Agglomerative Clustering model 
AC = AgglomerativeClustering(n_clusters=4)
# fit model and predict clusters
yhat_AC = AC.fit_predict(PCA_ds)
PCA_ds["Clusters"] = yhat_AC
#Adding the Clusters feature to the orignal dataframe.
data["Clusters"] = yhat_AC
In [20]:
#Plotting the clusters
cm = plt.get_cmap('rainbow')
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection = "3d")
ax.scatter(x, y, z, s=40, c=PCA_ds["Clusters"], marker='o',cmap=cm)
ax.set_title("The Plot Of The Clusters")
plt.show()

EVALUATING MODELS

In [21]:
#Plotting countplot of clusters
pal=["#0048BA","#FFFF19","#D3212D","#3B7A57"]
plt.figure(figsize=(10, 8))
plt.xlabel('Clusters')
plt.ylabel('count')
pl = sns.countplot(x=data["Clusters"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()
In [27]:
pl = sns.scatterplot(data = data, x = data["Spent"], y = data["Income"], hue = data["Clusters"], palette = pal)
pl.set_title("Cluster's Profile Based On Income And Spending")
plt.legend()
plt.show()

Income vs spending plot shows the clusters pattern

group 0: high spending & average income group 1: high spending & high income group 2: low spending & low income group 3: high spending & low income Next, I will be looking at the detailed distribution of clusters as per the various products in the data. Namely: Wines, Fruits, Meat, Fish, Sweets and Gold

In [29]:
plt.figure()
pl=sns.swarmplot(x=data["Clusters"], y=data["Spent"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=data["Clusters"], y=data["Spent"], palette=pal)
plt.show()
In [30]:
#Creating a feature to get a sum of accepted promotions 
data["Total_Promos"] = data["AcceptedCmp1"]+ data["AcceptedCmp2"]+ data["AcceptedCmp3"]+ data["AcceptedCmp4"]+ data["AcceptedCmp5"]
#Plotting count of total campaign accepted.
plt.figure()
pl = sns.countplot(x=data["Total_Promos"],hue=data["Clusters"], palette= pal)
pl.set_title("Count Of Promotion Accepted")
pl.set_xlabel("Number Of Total Accepted Promotions")
plt.show()
In [31]:
#Plotting the number of deals purchased
plt.figure()
pl=sns.boxenplot(y=data["NumDealsPurchases"],x=data["Clusters"], palette= pal)
pl.set_title("Number of Deals Purchased")
plt.show()

Unlike campaigns, the deals offered did well. It has best outcome with cluster 0 and cluster 3. However, our star customers cluster 1 are not much into the deals. Nothing seems to attract cluster 2 overwhelmingly

PROFILING

In [32]:
Personal = [ "Kidhome","Teenhome", "Age", "Children", "Family_Size", "Is_Parent", "Education","Living_With"]

for i in Personal:
    plt.figure()
    sns.jointplot(x=data[i], y=data["Spent"], hue =data["Clusters"], kind="kde", palette=pal)
    plt.show()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>

CONCLUSION

In this project, I performed unsupervised clustering. I did use dimensionality reduction followed by agglomerative clustering. I came up with 4 clusters and further used them in profiling customers in clusters according to their family structures and income/spending. This can be used in planning better marketing strategies.